The wine quality dataset was created by Paulo Cortez (Univ. Minho), Antonio Cerdeira, Fernando Almeida, Telmo Matos and Jose Reis (CVRVV) @ 2009, using red wine samples. The inputs include objective tests (e.g. PH values) and the output is based on sensory data (median of at least 3 evaluations made by wine experts). Each expert graded the wine quality between 0 (very bad) and 10 (very excellent).
## [1] 1599 12
## [1] "fixed.acidity" "volatile.acidity" "citric.acid"
## [4] "residual.sugar" "chlorides" "free.sulfur.dioxide"
## [7] "total.sulfur.dioxide" "density" "pH"
## [10] "sulphates" "alcohol" "quality"
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 4.60 Min. :0.1200 Min. :0.000 Min. : 0.900
## 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090 1st Qu.: 1.900
## Median : 7.90 Median :0.5200 Median :0.260 Median : 2.200
## Mean : 8.32 Mean :0.5278 Mean :0.271 Mean : 2.539
## 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420 3rd Qu.: 2.600
## Max. :15.90 Max. :1.5800 Max. :1.000 Max. :15.500
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :0.01200 Min. : 1.00 Min. : 6.00
## 1st Qu.:0.07000 1st Qu.: 7.00 1st Qu.: 22.00
## Median :0.07900 Median :14.00 Median : 38.00
## Mean :0.08747 Mean :15.87 Mean : 46.47
## 3rd Qu.:0.09000 3rd Qu.:21.00 3rd Qu.: 62.00
## Max. :0.61100 Max. :72.00 Max. :289.00
## density pH sulphates alcohol
## Min. :0.9901 Min. :2.740 Min. :0.3300 Min. : 8.40
## 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500 1st Qu.: 9.50
## Median :0.9968 Median :3.310 Median :0.6200 Median :10.20
## Mean :0.9967 Mean :3.311 Mean :0.6581 Mean :10.42
## 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300 3rd Qu.:11.10
## Max. :1.0037 Max. :4.010 Max. :2.0000 Max. :14.90
## quality
## Min. :3.000
## 1st Qu.:5.000
## Median :6.000
## Mean :5.636
## 3rd Qu.:6.000
## Max. :8.000
By creating histogram plots is a good way to have an idea about how each attributes are changing by themselves. The plots will help me to know all features at the first view.
Many of the variables look normally distributed. Chlorides, sulphates, alcohol, free sulfur dioxide and total sulfur dioxide look like they have lognormal distributions. Let’s exclude the 95th percentile for all these five features and re-plot their histograms:
The distributions for chlorides, sulphates, alcohol, free sulfur dioxide, and total sulfur dioxide look normal after excluding the outliers.
Number of red wine instances: 1599 Number of Attributes: 1 Serial Number + 11 Attributes + 1 Output Attribute
11 Attributes:
1 - fixed acidity: most acids involved with wine or fixed or nonvolatile (do not evaporate readily)
2 - volatile acidity: the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste
3 - citric acid: found in small quantities, citric acid can add ‘freshness’ and flavor to wines
4 - residual sugar: the amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet
5 - chlorides: the amount of salt in the wine
6 - free sulfur dioxide: the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine
7 - total sulfur dioxide: amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine
8 - density: the density of water is close to that of water depending on the percent alcohol and sugar content
9 - pH: describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale
10 - sulphates: a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant
11 - alcohol: the percent alcohol content of the wine
Output variable (based on sensory data):
12 - quality (score between 0 and 10)
Quality is the main feature.
Residual sugar, fixed acidity, pH, density and alcohol content may help support the investigation into the quality.
Yes, I do. Since the first column is all serial numbers, there is not any statistical significance. The column, named X, has been remmoved from the original dataset.
Attributes of chlorides, total sulfur dioxide, and free sulfur dioxide, sulphates, alcohol were all appeared to be long tailed and were log-transformed which revealed a normal distribution for each.
## [1] "fixed.acidity" "volatile.acidity" "citric.acid"
## [4] "residual.sugar" "chlorides" "free.sulfur.dioxide"
## [7] "total.sulfur.dioxide" "density" "pH"
## [10] "sulphates" "alcohol" "quality"
With our main feature of the dataset, the positive correlation coefficients which are more then 0.1 are:
alchol:quality = 0.5
sulphates:quality = 0.3
citric.acid:quality = 0.2
fixed.acidity:quality = 0.1
So alcohol content has a high correlation with red wine quality. Other important attributes correlated with red wine quality include sulphates, citric acid and fixed acidity.
As we can see, from above plot with alcohol contect across quality, there is a large amount of samples with quality score 5 and also 9.5% alcohol. The samples with a higher quality score also have a higher alcohol percentage.
With our main feature of the dataset, the negative correlation coefficients which are less then -0.1 are:
volatile.acidity:quality = -0.4
total.sulfur.dioxide:quality = -0.2
density:quality = -0.2
chlorides:quality = -0.1
So we see that volatile acids are negatively correlated with red wine quality, as described from the document that is at too high of levels can lead to an unpleasant, vinegar taste. Total sulfur dioxide, density and chlorides are also negatively correlated with quality.
Besides, other attributes wiht the highest (positive or negative) correlation are:
fixed.acidity:pH = -0.7
fixed.acidity:citric.acid = 0.7
fixed.acidity:density = 0.7
free.sulfur.dioxide:total.sulfur.dioxide = 0.7
volatile.acidity:citirc.acid = -0.6
citric.acid:pH = -0.5
density:alcohol = -0.5
As we all know, the stronger the acid is, the lower pH will be. So it is make sence that either fixed acidity or citric acid has a high negative correlation with pH. All three features are acids. I’ve thought all acids will lower the value of pH. However, from above plot of pH across volatile acidity, with more content of volatile acidity, the value of pH increase a little bit. From this set of plots, I found the acidity of volatile is weaker than the other two acids, and fixed acid should be the strongest one here.
I will focus on several other highest correlation relationships in a bit more detail.
Wine Acids play a large role in winemaking. Each acid plays a different role in the winemaking game. I would like to see how three kinds of acids working with the quality of red wine. Fixed Acidity is a background player, supporting and stabilizing the wine as it evolves. It preserves the stability of the wine. So, the incresing of fixed acidity does not affact a lot on wine’s quality, but helped a little bit. Volatile Acidity, also known as malic acid, is high prior to veraison, but as grapes ripen, it escapes the grapes through respiration. Cooler climates produce grapes with higher levels of malic acid due to the cooler temperatures and low rates of respiration. In another words, volatile acidity can be virtually nothing if it is a really hot year. The malic acid gets used up in the respiration. Malic is a harsher acid, which at too high of levels can lead to an unpleasant, vinegar taste. So as the plot showing, the more volatile acidity the wine contains, the lower quality the wine will be Citric acid is in a really small amount compared with the other two acids. It can be noticed from the range of each x-axis of above three plots. The data points from the plot of quality with fixed acidity are focusing from 6 to 10, which is almost 14 times of with volatile acidity and almost 32 times of with citric acid.
The density of wine somehow descides the taste of thick or refreshing. The description of density says the density of wine is close to that of water depending on the percent alcohol and sugar content. I would like to find out how alcohol and sugar content affact density, and also how density would work with the quality. Obviously, the density of wine has been affacted a lot by alcohol and residual sugar. The wine would be more refreshing while having more alcohol, however would be thicker if adding more sugar content. Well, the wine sometimes has too much acidity, but winemakers don’t want to remove the volatile acid, they balance with sugar. The acids could be balanced off with residual sugar. Adding residual sugar might balance the taste of the wine. As we can see from the plot of quality with density, the quality is decreasing while the wine become thicker.
As of the feature of volatile acidity has a large negative pearson correlation coefficient. I would like to see more detail of how volatile acidity working with wine quality. Based on the above boxplot, it is really easy to tell the observed result. High value of volatile acidity is truly lower the score of wine quality.
At the end of bbivariate analysis, I would like to re-focus on the main feature of dataset, which is quality. This boxplot is also showing one of the strongest relationship between quality and alcohol.
As of the quality, it appears that when alchol or sulphates is in higher amounts, the quality will be better also. However, the amount of volatile acidity is negatively correlated with the quality. It is likely that fresher wines avoid the bitter taste of acetic acid.
As of citric acid, fixed acidity is positively correlated with the citric acid, but the amount of volatile acidity is opposite. As of density, fixed acidity is also positively correlated with the citric acid, but the amount of alcohol is opposite.
From the variables analyzed, the strongest relationship was between fixed.acidity and pH, which had a correlation coefficient of 0.68.
Now let’s visualize the relationship between density, volatile.acidity, alcohol and quality: On the above scatter plot, the darker the blue is means the wine with higher quality. In other words, white points are with the lowest quality, and darkest blue points are with the highest quality. Besides, the most of white points are shown on the up-left part of canvas, and the bottom right corner has more blue or dark blue points. It means that most of the wine with higher quality scores have higher alcohol content and also lower volatile acidity.
Below faceted plots tried to see how sulphates or alcohol affacts the quality of wine Interesting, sulphates also slightly affact the quality of wine. From above six plots, all scatter points are slightly moving to right. We almost cannot realize between two contiguous plots. But while we compare the first one with the fifth or sixth one, it actually shift a step to right. It comes out that sulphates help a little bit to increase the score of wine quality.
Next, let’s try to summarize quality using a contour plot of volatile acidity and sulphate content: Now, we almost can tell the result before plotting. As of sulphates are positive correlated with quality, while volatile acidity is negative correlated with quality. So, the contour plot with the highest score of quality should show up with higher value of sulphates and lower value of volatile acidity. No wonder, the plot exactly shows the result what we are expecting.
This shows that higher quality red wines are generally located near the range from 0.25 to 0.65 of citric acid and slso near the higher alcohol which is more than 10.5%. Whereas lower quality red wines are generally with lower either alcohol or citric acid.
Let’s try to summarize quality using a contour plot of density and alcohol content: From above plot, we can tell that density does not really affact a lot of quality, but alcohol does.
I am tring to use this plot to tell the same result with the previous one. However, the latest plot can tell more information within one plot.
Based on the multivariate analysis, five features stood out to me: alcohol, sulphates, citric acid, volatile acidity, and quality. Volatile acidity with amount between 0.3 and 0.5 and sulphates with amount between 0.6 and 0.9 were a strong indicator of the presence of good wine. Also, high alcohol content and higher citric acid have more chance to make for a good wine.
As analyzing relationship between quality and other 11 attributes, the strongest correlation coefficient was found between alcohol and quality.
## # A tibble: 6 x 2
## quality n
## <int> <int>
## 1 3 10
## 2 4 53
## 3 5 681
## 4 6 638
## 5 7 199
## 6 8 18
## wqr$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.400 9.725 9.925 9.955 10.575 11.000
## --------------------------------------------------------
## wqr$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.00 9.60 10.00 10.27 11.00 13.10
## --------------------------------------------------------
## wqr$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.5 9.4 9.7 9.9 10.2 14.9
## --------------------------------------------------------
## wqr$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.80 10.50 10.63 11.30 14.00
## --------------------------------------------------------
## wqr$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.20 10.80 11.50 11.47 12.10 14.00
## --------------------------------------------------------
## wqr$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.80 11.32 12.15 12.09 12.88 14.00
Clearly we see that the box plots for higher quality red wines are up shifted, meaning they have a comparatively higher alcohol content, compared to the lower quality red wines.
Observe that lower sulphates content typically leads to a bad wine with alcohol varying between 9% and 12%. Average wines have higher concentrations of sulphates, however wines that are rated 6 tend to have higher alcohol content and larger sulphates content. Excellent wines are mostly clustered around higher alcohol contents and higher sulphate contents.
This shows that higher quality red wines are generally having higher percentage of alcohol, which is more than 11%, and having slightly lower density, which means the refreshing wine is somehow being more popular. With the help of density, actually, for the low quality levels with score of 3, 4 and 5, it is hard to tell how alcohol percentage affact the quality. Then, for the high levels of 6, 7 and 8, it is so obverious that more alcohol content would result a better wine quality.
The red wine dataset contains information on 1,599 red wine instances, 11 attributes and one output attribute. Initially, I tried to get a sense of how is each attribute changing on their own. All univariate plots have been arranged together. Many of the variables look normally distributed. However, some of features have lognormal distributions. I exclude the 95th percentile for these features and re-plot their histograms.
Then, I tried to find what factors might affect the quality of the wine. At this moment, pearson correlation coefficient can help us to visualize the relationship between each pair of variables. Using the insights from correlation coefficients provided by the paired plots, it was interesting exploring quality using box plots with a different color for each quality. Besides, melting the dataframe and using facet grids was really helpful for visualizing the distribution of the parameters with the use of scatter plots. Finally, using a contour plot of wine quality with a point plot of volatile acidity and alcohol would be a good choice to show that either the lower volatile acidity or higher alcohol have more possible to make a better wine. The result makes sense. Volatile acidity is mostly caused by bacteria in the wine which is the amount of acetic acid in wine. It can lead to an unpleasant, vinegar taste if at too high of levels.
The hardest time for me is to understanding all the features with wiki pedia or other documents. But to be a good data analyst, we must study and understand the data structure as much as we can. Finally, I figured out all attributes of red wine.
The dataset may include more features of the environment where grapes are grown. As we all known, location and temperature play the important roles in the quality of wine.
P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.